171 research outputs found
Text Classification Using Association Rules, Dependency Pruning and Hyperonymization
We present new methods for pruning and enhancing item- sets for text
classification via association rule mining. Pruning methods are based on
dependency syntax and enhancing methods are based on replacing words by their
hyperonyms of various orders. We discuss the impact of these methods, compared
to pruning based on tfidf rank of words.Comment: 16 pages, 2 figures, presented at DMNLP 201
Thematically Reinforced Explicit Semantic Analysis
We present an extended, thematically reinforced version of Gabrilovich and
Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic
information through the category structure of Wikipedia. For this we first
define a notion of categorical tfidf which measures the relevance of terms in
categories. Using this measure as a weight we calculate a maximal spanning tree
of the Wikipedia corpus considered as a directed graph of pages and categories.
This tree provides us with a unique path of "most related categories" between
each page and the top of the hierarchy. We reinforce tfidf of words in a page
by aggregating it with categorical tfidfs of the nodes of these paths, and
define a thematically reinforced ESA semantic relatedness measure which is more
robust than standard ESA and less sensitive to noise caused by out-of-context
words. We apply our method to the French Wikipedia corpus, evaluate it through
a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a
precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201
Indica, an Indic preprocessor for TeX. A Sinhalese TeX System
International audienceIn this paper a two-fold project is described: the first part is a generalized preprocessor for Indic scripts (scripts of languages currently spoken in India—except Urdu—, Sanskrit and Tibetan), with several kinds of input (LaTeX commands, 7-bit ASCII, CSX, Unicode) and TeX output. This utility is written in standard Flex (the GNU version of Lex), and hence can be painlessly compiled on any platform. The same input methods are used for all Indic languages, so that the user does not need to memorize different conventions and commands for each one of them. Moreover, the switch from one language to another can be done by use of user-defineable preprocessor directives.The second part is a complete TeX typesetting system for Sinhalese. The design of the fonts is described, and METAFONT-related features, such as metaness and optical correction, are discussed.At the end of the paper, the reader can find tables showing the different input methods for the four Indic scripts currently implemented in Indica: Devanagari, Tamil, Malayalam, Sinhalese
Les math\'ematiques de la langue : l'approche formelle de Montague
We present a natural language modelization method which is strongely relying
on mathematics. This method, called "Formal Semantics," has been initiated by
the American linguist Richard M. Montague in the 1970's. It uses mathematical
tools such as formal languages and grammars, first-order logic, type theory and
-calculus. Our goal is to have the reader discover both Montagovian
formal semantics and the mathematical tools that he used in his method.
-----
Nous pr\'esentons une m\'ethode de mod\'elisation de la langue naturelle qui
est fortement bas\'ee sur les math\'ematiques. Cette m\'ethode, appel\'ee
{\guillemotleft}s\'emantique formelle{\guillemotright}, a \'et\'e initi\'ee par
le linguiste am\'ericain Richard M. Montague dans les ann\'ees 1970. Elle
utilise des outils math\'ematiques tels que les langages et grammaires formels,
la logique du 1er ordre, la th\'eorie de types et le -calcul. Nous
nous proposons de faire d\'ecouvrir au lecteur tant la s\'emantique formelle de
Montague que les outils math\'ematiques dont il s'est servi.Comment: 14 pages, in French. Will appear in the journal Quadrature
(http://www.quadrature.info) in 201
Unicode, XML, TEI, Ω and Scholarly Documents
International audienc
The Khmer Script Tamed by the Lion (of TeX)
International audienceThis paper presents a Khmer typesetting system, based on TeX, METAFONT, and an ANSI-C filter. A 128-character of the 8-bit ASCII table for the Khmer script is proposed. Input of text is done phonically (using the spoken order consonant-subscript consonant-second subscript consonant-vowel-diacritic). The filter converts phonic description of consonantal clusters into a graphic TeXnical description of these. Thanks to TeX booleans, independent vowels can be automatically decomposed according to recent reforms of Khmer spelling. The last section presents a forthcoming implementation of Khmer into a 16-bit TeX output font, solving the kerning problem of consonantal clusters
Virtual Fonts: Great Fun, Not for Wizards Only
International audienc
The Traditional Arabic Typecase, Unicode, TeX and METAFONT
International audienc
- …